Code
library(tidyverse)
library(DT)
library(plotly)
library(gridExtra)
pop <- read_csv("data/pop.csv")
hap <- read_csv("data/hapiscore_whr.csv")
pov <- read_csv("data/gm_365pov_rate.csv")Can Money Buy Happiness?
Country: Country
Year: Year
Poverty: The percentage of the population in poverty (0-100). Poverty here is defined as living off of less than $3.65 per day. This is not adjusted for inflation.
Happiness Score: Happiness as a percent indicated by WHR. The World Happiness Report (WHR) is scored from the national average response to the questions of life evaluations.
Observational Unit: Country and Year
Happiness and Poverty levels are linearly correlated. So we hypothesize that money can buy happiness.
We chose to drop all observations with na values, we may go back later and choose to do some data imputation to keep more data. However, we found ththat dropping na resulted in a loss of ~800 data points from the original 3200.
library(tidyverse)
library(DT)
library(plotly)
library(gridExtra)
pop <- read_csv("data/pop.csv")
hap <- read_csv("data/hapiscore_whr.csv")
pov <- read_csv("data/gm_365pov_rate.csv")clean_hap <- hap |>
pivot_longer(cols = `2005`:`2023`,
names_to = "Year",
values_to = "Happiness") |>
rename(Country = country) |>
drop_na()
clean_pov <- pov |>
select(`2005`:`2023`, country) |>
pivot_longer(cols = `2005`:`2023`,
names_to = "Year",
values_to = "Poverty") |>
rename(Country = country) |>
drop_na()
joined_pov_hap <- clean_hap |>
inner_join(clean_pov, by = join_by(Country == Country, Year == Year))
datatable(joined_pov_hap,
caption = htmltools::tags$caption(
style = 'caption-side: top; text-align: left; font-size: 24px;',
"Poverty and Happiness Data, Joined by Country and Year"
),
options = list(pageLength = 10)
)Displays the natural log of the ratio of the Happiness score and Poverty given year. Taking the natural log minimizes the impact for the skewness of the data, and accounts for severe outliers. The general trend of the ratio shows that either happiness increases over time or that poverty decreases:
# Create a new variable: Happiness/Poverty ratio
joined_pov_hap <- joined_pov_hap |>
mutate(Hap_Pov_Ratio = Happiness / Poverty) |>
mutate(Log_Hap_Pov_Ratio = log(Hap_Pov_Ratio))
# Plot boxplots of the ratio over time
ggplot(joined_pov_hap, aes(x = as.factor(Year), y = Log_Hap_Pov_Ratio)) +
#scale_y_continuous(limits = quantile(joined_pov_hap$Log_Hap_Pov_Ratio, c(.1, .9))) +
geom_boxplot(fill = "lightblue", alpha = 0.6, outliers = FALSE, outliers.shape = NA) + # Boxplots without outliers
#geom_jitter(width = 0.2, alpha = 0.3, color = "blue") + # Jitter for individual points
labs(title = "Distribution of the Natural Log of Happiness/Poverty Ratio Over Time",
x = "Year",
y = "Ln(Happiness / Poverty Ratio)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) Displays the relationship between the poverty rate and happiness score averaged over the years 2005 to 2023 where each point represents a different country. The general trend of points indicates that there is a negative relationship between poverty rate and happiness score; countries with higher poverty rates have lower happiness scores:
pov_hap_avg <- joined_pov_hap |>
group_by(Country) |>
summarize(Avg_Happiness = mean(Happiness, na.rm = TRUE),
Avg_Poverty = mean(Poverty, na.rm = TRUE), .groups = "drop")
# Scatterplot with one observation per country
scatter_plot <- ggplot(pov_hap_avg, aes(x = Avg_Poverty, y = Avg_Happiness)) +
geom_point(alpha = 0.6, color = "darkblue") +
# geom_smooth(method = "lm", color = "red", se = FALSE) +
labs(title = "Averaged Relationship Between Happiness and Poverty",
x = "Average Poverty Rate (%)",
y = "Average Happiness Score (%)") +
theme_minimal()
ggplotly(scatter_plot)Linear Regression is a supervised machine learning technique to learn the relationship between a dependent variable and one or more independent variables. Linear Regression assumes that there is some sort of linear relationship between the two variables. Linear Regression attempts to calculate the best Fit line, where the slope indicates the strength of the relationship between the two variables.
source: https://www.geeksforgeeks.org/ml-linear-regression/
model <- lm(Happiness ~ Poverty, data = joined_pov_hap) #https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/predict.lm
summary(model)
Call:
lm(formula = Happiness ~ Poverty, data = joined_pov_hap)
Residuals:
Min 1Q Median 3Q Max
-41.225 -4.914 0.145 5.281 20.105
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 63.712958 0.227018 280.65 <2e-16 ***
Poverty -0.239021 0.004437 -53.87 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.543 on 2340 degrees of freedom
Multiple R-squared: 0.5536, Adjusted R-squared: 0.5534
F-statistic: 2902 on 1 and 2340 DF, p-value: < 2.2e-16
# print(tibble(predict(model)))
pre_variance = joined_pov_hap |>
select(Happiness) |>
mutate(predicted_happiness = predict(model)) |> # https://www.math.ucla.edu/~anderson/rw1001/library/base/html/predict.lm.html#:~:text=lm%20produces%20predicted%20values%2C%20obtained,of%20the%20predictions%20are%20calculated.
mutate(residual = Happiness - predicted_happiness)
variances = pre_variance |> # https://www.marsja.se/variance-in-r-how-to-find-calculate/
mutate(happiness_variance = var(Happiness),
predicted_variance = var(predicted_happiness),
residual_variance = var(residual)) |>
select(happiness_variance, predicted_variance, residual_variance) |>
head(1)
knitr::kable(variances)| happiness_variance | predicted_variance | residual_variance |
|---|---|---|
| 127.3947 | 70.5234 | 56.87134 |
The linear regression model predicting Happiness from Poverty produced an R^2 value of 0.5536, meaning it explains ~55.4% of the variation in happiness.
set.seed(1901)
predicted_values <- predict(model)
simulated_happiness <- predicted_values + rnorm(length(predicted_values), mean = 0, sd = sigma(model))
simulated_data <- joined_pov_hap |>
mutate(Simulated_Happiness = simulated_happiness)
p1 <- ggplot(joined_pov_hap, aes(x = Poverty, y = Happiness)) +
geom_point(alpha = 0.6, color = "steelblue") +
labs(title = "Observed Happiness vs. Poverty", x = "Poverty", y = "Happiness") +
theme_minimal()
p2 <- ggplot(simulated_data, aes(x = Poverty, y = Simulated_Happiness)) +
geom_point(alpha = 0.6, color = "orange") +
labs(title = "Simulated Happiness vs. Poverty", x = "Poverty", y = "Simulated Happiness") +
theme_minimal()
interactive_plot1 <- ggplotly(p1)
interactive_plot2 <- ggplotly(p2)
subplot(interactive_plot1, interactive_plot2, nrows = 1, margin = 0.025)library(gridExtra)
#grid.arrange(p1, p2, ncol = 2)The plot on the left displays the observed relationship between happiness and poverty, showing a clear negative association—as poverty increases, happiness tends to decrease. The plot on the right shows simulated data generated from our model by taking predicted values from the regression and adding random error based on the residual standard error.
set.seed(1969)
num_simulations <- 1000
r_squared_values <- numeric(num_simulations)
for (i in 1:num_simulations) {
simulated_y <- predicted_values + rnorm(length(predicted_values), mean = 0, sd = sigma(model))
temp_model <- lm(simulated_y ~ joined_pov_hap$Poverty)
r_squared_values[i] <- summary(temp_model)$r.squared
}
r_squared_df <- tibble(R_Squared = r_squared_values)
r_squared_hist <- ggplot(r_squared_df, aes(x = R_Squared)) +
geom_histogram(binwidth = 0.02, fill = "purple", alpha = 0.7, color = "black") +
labs(title = "Distribution of R-Squared from Simulated Regressions",
x = "R-Squared", y = "Frequency") +
theme_minimal()
ggplotly(r_squared_hist)To evaluate the overall performance of our model, we conducted 1,000 simulations each time generating new simulated happiness scores using our model and calculating the R-squared valuefrom a regression of simulated data on poverty.
The histogram above shows the distribution of R-squared values from these simulations. We see that most R-squared values fall within a narrow band (e.g., around 0.54-0.57), consistent with our original model’s R-squared. This supports the idea that our model does consistendly explain a similar amount of variance in the data across different simulated datasets.
While completing our final project, we utilized options A, C, and D.
Option A: We collaborated over Github, which can be found at this link: https://github.com/spaykel/331finalproject
Option C: We formatted our tables nicely using DT and Kable.
Option D: We animated our plots using ggplotly. Hover over some of our plots to see more information on the data!